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ABSTRACT 

This research program was designed to investigate the 
applications of item response theory and computerized adaptive 
testing to the unique problems of the measurement of ability and the 
measurement of achievement. The research utilized a combination of 
monte carlo simulatio. studies and live-testing studies. The research 
approach for adaptive achievement testing included: intersubtest 
branching; the dimensionality of measured achievement over time; 
adaptive mastery te;;ting; and adaptive self-referenced testing. The 
research approach for adaptive ability testing included: adaptive 
testing strategies; and response modes, test item formats and effects 
of test administration variables. ThiL research is stimmarized and 
related to ten technical reports and other publications. Abstracts of 
the technical reports are included and the major research findings 
are presented. (PN) 
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final report 

Computerized Adaptive Measurement of Achievement and Ability 



Objectives 

This research program was designed to Investigate the applications of Item 
response theory (IRT) and computerized adaptive testing to the unique problems 
of the measurement of ability and the measurement of achievement. Specific ob- 
jectives relevant to these two areas were as follows: 

Adaptive Achievement Testing 

!• To Gcudy the relative efficiency of various approaches to intersubtest 
br^^nching in achievement test batteries. 

2. To investigate the dimensionality of measured achievement over time. 

3. To study the applicability of IRT models to the problem of mastery testing 
and to compare models for adaptive mastery testing with other approaches to 
the improvement of mastery decisions and/or reduction in test length in mas- 
tery testing. 

4. To explicate the concept of Adaptive Self-Referenced Testing and to examine 
its applicability to the achievement testing problem. 

Adaptive Ability Testing 

5. To evaluate the performance of adaptive testing strategies under conditions 
which more reasonably represent the conditions under which these strategies 
might be used, and to examine the performance of adaptive testing strategies 
in live testing. 

6. To evaluate the utility for adaptive tasting of response modes and test item 
formats usable in adaptive ability testing. 

Research in pursuance of these objectives began in February 1979 and continued 
through April 1983. 

Approach 

The research utilized a combination of monte carlo simulation studies and 
live-testing studies. 

Adaptive Achievement Testing 

Intersubtest branching . Intersubtest branching is an approach to the uti- 
lization of adaptive testing methodologies in a multidimensional item pool. In 
intersubtest branching, IRT item parameters are estimated separately for each 
subtest of a multisubtest battery. Using any of a number of adaptive testing 
strategies, adaptive testing occurs within the subtest based on appropriate item 
selection rules and a test termination criterion appropriate for the purpose of 
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testing. Upon completion of a subtest in the test battery, the final trait lev- 
el estimate (g) is then used as an entry point to begin testing in a subsequent 
subtest in the battery. As originally proposed, subtests in a battery are or- 
dered by the magnitudes of the squared multiple correlations of each subtest 
with all other subtests in the battery. In this way, the entry points for adap- 
tive testing in each subtest utilize the information available in the tests in 
the test battery that were most highly correlated with it, which should shorten 
thu adaptive tests for later subtests fn the battery as much as possible. 

Intersubtest adaptive branching was studied by real-data simulation in Re- 
search Report 79-6, and by monte carlo simulation in Research Report 80-4. The 
study reported in Research Report 79-6 used data from conventionally-administer- 
ed tests which were analyzed as if they had been administered as an adaptive 
test, and the intersubtest branching strategy was applied to these data. This 
study was designed to separate the effects of the adaptive intrasubtest item 
selection procedure from those effects due to intersubtest branching. The study 
also (1) allowed evaluation of the effects of different intrasubtest termination 
criteria, (2) investigated the effect of taking into account errors of measure- 
ment in the multiple regression procedure used to determine test entry points, 
and (3) investigated the stability of the regression equations in cross-valida- 
tion. 

Other aspects of the intersubtest branching strategy when applied to an 
achievement test battery were investigated by monte carlo simulation in Research 
Report 81-A. Questions of interest in this study included (1) the effects of 
varying subtest order, (2) the utilization of different subtest termination cri- 
teria, and (3) the effect of variable versa fixed entry on the psychometric 
properties of the intersubtest branching strategy. Dependent variables included 
(I) reductions in test length, (2) effect on test information, and (3) correla- 
tions between achievement estimates and true achievement levels. The study de- 
sign also permitted separation of the effects of intrasubtest and intersubtest 
adaptive branching. 

The dimensionality of measured achievement over time . The effects of in- 
struction on measured achievement are usually measured at a single point in 
time. That is, some instruction is given to an individual and at the end of the 
period of instruction an achievement test is used to determine whether the indi- 
vidual has reached an appropriate level of achievement. On the basis of such 
information, aggregated across individuals, decisions are frequently made about 
the adequacy of instructional programs, or about the impact (or lack thereof) of 
instruction on a specific individual. 

A more powerful approach to the measurement of achievement would involve 
the use of pretests and posttests to determine if any change has occurred in 
measured achievement over time. Using change scores, however, implies that the 
variable bein'- measured is the same at pretest as it is at posttest. There has 
been very lit • empirical da^ ; available concerning this issue. 

Research ...port 81-5 was designed to investigate the question 6f whether 
the achievement factor identified at pretest in an achievement test is the same 
factor identified at posttest. Two studies utilized data on groups of college 
students from measured achievement in mathematics classes and biology classes. 
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Achievement test item responses were factor analyzed prior to instruction, and 
again at the end of instruction. In addition, mean differences in test scores 
at pretest and pcsttest were analyzed. Factors obtained at pretest were com- 
pared with those obtained at posttest to determine if the same factor was found 
prior to and after instruction. 

Kingsbury (198A) directly examined the characteristics of change scores 
derived from adaptive and conventional tests. This study utilized data from 
college-level biology examinations. Both adaptive and conventional tests were 
administered in a complex design to groups of students in such a way that relia- 
bilities of the change scores could be determined separately for the two types 
of tests in a number of different homogeneous content areas covered In the 
course. The question raised by this study was based on hypotheses that the more 
precise achievement level estimates resulting from adaptive testing should also 
result in more reliable change scores in comparison to those from conventional 
testing. Also studied was the effect of variable- versus fixed-length test ter- 
mination on the adaptive tests. 

Adaptive Mastery Testing . Adaptive mastery testing (AMI) combines IRT and 
adaptive testing into an efficient strategy for making mastery or classification 
decisions. In this procedure, items used to make a mastery decision are select- 
ed by an IRT maximum inforrjation adaptive testing strategy. Item responses are 
scored using a Bayesian 6 estimation procedure, and a confidence or credibility 
interval is computed for the 6 estimate. The confidence interval around the 
estimate is then compared to a mastery cuto*"*. score, which is also expressed on 
the 0 metric. A mastery decision is determined on the basis of whether the 
credibility interval overlaps with the mastery criterion level, and on which 
side of the mastery cutoff score the individual's 9 estimate falls. 

Both monte carlo simulation and live testing were used to investigate char- 
acteristics of the AMT strategy and to compare it with other approaches for mak- 
ing mastery decisions. In Research Report 80-4 (also Kingsbury & Weiss, 1980a) 
the AMT procedure was compared to a conventionally-based mastery testing proce- 
dure and to a procedure based on Wald's sequential probability ratio test. The 
procedures were compared in terms of their efficiency, based on the test length 
re^uired by the procedures to make a classification decision, on the validity of 
the decisions made by each procedure, and on the type of classifications made by 
each of the three testing procedures. 

To examine the generality of the findings in live testing, in Research Re- 
port 81-3 the AMT procedure and a conventional test were administered to stu- 
dents in a biology class. Contrary to earlier studies which examined the AMT 
procedure, actual adaptive mastery tests were administered to one subgroup of 
students while the other received computer-administered conventional tests. The 
performance of the two testing strategies was evaluated in terms of a mastery 
criterion based on the students* final standing in the course, which was a com- 
bination of their performance on course examinations and laboratory grades. 

Adaptive Self-Referenced Testing . Adaptive self-referenced testing (ASRT) 
is a combination of IRT and adaptive testing designed to permit the efficient 
measurement of changes in achievement levels due to exposure to instruction. 
This procedure is designed to measure individual changes in achievement in a 
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unidimensional item pool in a very efficient manner at a number of points of 
instruction. It is thus an appropriate conceptualization for tracking individu- 
al changes due to instruction at a number of points during a course, since it 
permits an instructor to evaluate an individual's performance on a minimum num- 
ber of items at each of a number of testing occasions. 

ASRT permits an instructor to measure a student early in a course, such as 
on the first day, and as frequently as is necessary during the course. Based on 
adaptive testing methodology, the data obtained from the Time 1 testing are used 
as the entry point to Time 2 adaptive Lest administration » and this process is 
followed for any number of test administrations. In addition, test termination 
at any point in time can be based on the standard error band associated with an 
individual's 8 estimate at that point in time. ASRT is designed to simultane- 
ously permit intraindividual measurement of change, norm-based measurement on 
the 9 metric which can then be converted to the proportion-correct measurement 
if desired, and a mastery-based (criterion-referenced) achievement level esti- 
mate utilizing the procedures of AMT. While no research directly related to 
ASRT was done during the contract period, the method was described in some de- 
tail In Weiss & Kingsbury (1984). Both Research Report 81-5 and the Kingsbury 
(198A) study have implications for the use of ASRT and its future development. 

Adaptive Ability Testing 

Adaptive testing strategies . A major focus of this research program was on 
the evaluation of different approaches to computerized adaptive testing. While 
earlier projects were concerned primarily with evaluating the relative perfor- 
mance of adaptive and conventional testing strategies, in this project the focus 
was on the iRT-based strategies and on their performance under a variety of con- 
ditions. An overview of some aspects of project research is given by Weiss 
(1982). 

The performance of a Bayesian adaptive testing strategy was repc^rted in 
Research Report 83-2 (also Weiss & McBride, 1984). Owen's Bayesian adaptive 
testing strategy was examined in three studies which utilized an accurate prior 
^ estimate, a constant prior e estimate with fixed test length, and a constant 
prior e estimate with variable test length. The performance of the adaptive 
testing strategy was examined in terms of the bias and information of the 9 es- 
timates as a function of 9. Also examined was the mean number of items adminis- 
tered in the variable test length condition. 

A major concern of the research was to evaluate the performance of adaptive 
testing strategies under conditions of increasing realisticness. Prior to these 
studies, all studies evaluating the performance of adaptive testing strategies 
did so under reasonably unrealistic conditions. While characteristics of the 
item pools varied in these earlier studies, the IRT item parameters used in 
these simulation studies were considered to be accurate. However, in real item 
pools, there is always some error associated with the item parameter estimates. 
Since adaptive testing is designed to select items on the basis of these item 
parameter estimates, it can be assumed that any degree of inaccuracy in the item 
parameter estimates will have detrimental effects on the performance of adaptive 
testing strategies. 
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Consequently, two studies were designed to investigate effects of errors in 
item parameter estimates on the performance: of maximum information and Bayesian 
adaptive testing strategies. The first study (Crichton, 1981) assessed the 
effects of errors in item parameter estimates in the context of the three-param- 
eter logistic model. Crichton compared the performance of the two IRT-based 
adaptive testing strategies — maximum information and Bayesian — with the strati- 
fied adaptive (stradaptive) strategy, on the hypothesis that the stradaptive 
strategy should be less sensitive to errors in the item parameter estimates. 
Her raonte carlo simulation study varied test length from 5 to 30 items. Test 
length was then crossed with three levels of error in the discrimination (a^) 
parameter, four levels of error in estimates for the difficulty (Jb) parameter, 
and two levels of error in the pseudo-guessing (c) parameter. In addition to 
considering these effects for b^, and c^ separately, two datasets examined the 
effects of joint errors in the a^, b^, and c^ parameters. Dependent variables con- 
ditional on 8 included the bias, root mean square error, inaccuracy, and infor- 
mation in the 9 estimates, and the correlation of 6 and @. 

Mattson (1983) also examined the performance of adaptive testing strategies 
under conditions of error in item parameter estimates, using monte carlo simula- 
tion. Mattson extended the Crichton study by studying similar effects in the 
one- and two-parameter logistic models, in addition to the three-parameter mod- 
el. Whereas Crichton limited her trait level estimation to maximum likelihood 
scoring of the response vectors, Mattson also included Bayesian scoring of the 
maximum information and Bayesian adaptive tests. In addition, Mattson allowed 
the level of correlation between the a^ and b^ param-^ters to vary at four differ- 
ent levels, as well as examining the uncorrelated condition used by Crichton* 
Similar to Crichton, Mattson also varied test length from 10 to 30 items. Fi- 
nally, Mattson allowed errors in the parameter to </dry at two levels, examined 
four levels of error in b^, and one level of error in c^. All conditions were 
crossed with each other. Mattson's dependent variables were the same as those 
studied by Crichton. 

A second factor that can affect the performance of adaptive testing strate- 
gies in a realistic item pool is the dimensionality of the item pool. Since all 
IRT models assume a unidimensional item pool, deviations from unidimensionality 
would be expected to affect the performance of adaptive testing strategies in 
real item pools, which are rarely (if ever) strictly unidimensional. As a re- 
sult Suhadoinik and Weiss (1985) examined the robustness of adaptive testing to 
multldlmensionality. 

In this study, the maximum information adaptive testing strategy using max- 
imum liklihood scoring was applied to datasets varying from strictly unidimen- 
sional to four-factor datasets that reflected the structure of the most multidi- 
mensional subtest of the Armed Service Vocational Aptitude Battery. Between 
these extremes were two- and three-factor datasets in which the second and third 
factors accounted for varying proportions of variance in comparison to the first 
factor, thus simulating item structures varying from very little multldlmension- 
ality, to a very high degree of multldlmensionality. A total of 45 data struc- 
tures was examined. 

To evaluate the effects of multldlmensionality, dichotomous item responses 
were simulated from the specified multidimensional structures. These item re- 
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sponses were then treated as If they were derived from a unldlmensonal model, 
and adaptive testing was implemented using the item response vectors. To evalu- 
ate the performance of the maximum information adaptive testing strategy under 
multidimensionality, the conditional bias, inaccuracy, and root mean square er- 
ror of the 9 estimates was computed relative to the true first factor 6 from the 
multidimensional structure. 

Response mo des, test item formats, and effects of test administration vari- 
ables. The administration of ability tests by interactive computers allows the 
use of item types that transcend the typical dicho'tomously-scored multiple- 
choice test item. Re^5earch Report 83-3 examined aspects of a probabilistic re- 
sponse mode used in conjunction with the typical multiple-choice item format. 
This response mode was chosen as one means of extracting additional information 
from a multiple-choice item, rather than simply requiring a choice of a single 
response alternative. 

A major problem with probabilistic responding to multiple-choice items in 
conventional paper-and-pencil test administration is that examinees do not al- 
ways follow the instructions carefully so that the probabilities they assign to 
the item responses does not always sum to I. 00. As a consequence, large amounts 
of data might be lost for a given examinee. When multiple-choice items are an- 
swered in a probabilistic mode on a computer terminal, however, the validity of 
the distribution of the probabilities can be checked Immediately for each indi- 
vidual's responses to each test item, and invalid responses can be adjusted un- 
til they meet the appropriate criteria. 

The utility of the probabilistic response mode was examined first by com- 
paring the usefulness of different scoring formulas associated with the response 
mode. Then, the factor structure resulting from the probabilistic response mode 
was studied in comparison to the factor structure obtained from scoring the re- 
sponses dichotomously. Also examined in Research Report 83-3 were the validi- 
ties of Che scores obtained from the different scoring methods, their reliabili- 
ties, and the effects of certainty or risk-taking on the probabilistic scores. 

Thompson's (1983) study also involved the administration of items in dif- 
ferent response formats to college students. The study crossed two response 
formats (categorical and probabilistic) with two item types (multiple-choice and 
dichotomous) to obtain four different types of test items. These were (I) the 
conventional multiple-choice item; (2) a probabilistic multiple-choice item, 
similar to that used in Research Report 83-3; (3) a dichotomous (yes, no) item; 
and (4) a dichotomous-probabilistic item in which an examinee answered by stat- 
ing, with a number between 0 and 100, his/her confidence that the answer to the 
question was the correct answer. Similar to Research Report 83-3, Thompson in- 
vestigated several scoring systems for ihe probabilistic items. In addition, 
the four test item types were evaluated in terms of the intercorrelations of the 
scores they provided, their reliabilities, and their factor structures. 

One other factor related to adaptive testing examined in this project con- 
cerned the effects of test administration variables on ability test performance 
and psychological reactions to testing. This study (Research Report 81-2) in- 
vestigated the effects of two variables unique to computer administration. One 
variable was immediate knowledge of results of the correctness of each item re- 
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sponse during the process of test administration. The second variable — pacing 
of item presentation-^as concerned with whether the pace of the test adminis- 
tration was Controlled by the examinee or by the computer* The two variables 
were studied in both computer-administered conventional and adaptive tests* The 
dependent variables included ability test performance (maximum likelihood 6 es- 
timates and proportion correct), response pattern information, item response 
latencies, and psychological reactions to testing. Data were obtained from 477 
college students who were randomly assigned to the experimental conditions. 



Adaptive Achievement Testing 

!• Adaptive intersubtest branching is a feasible approach to improving the 
efficiency of test administration when a test battety is adaptively admin- 
istered. This approach can reduce test battery length by 50% or more with 
no appreciable effect on the psychometric characteristics of scores on the 
tests in the battery (Research Reports 79-6 and 81-4). Although the major 
reductions in test battery length were attributable to adaptive intrasub- 
test item selection, there were additional small reductions in test length 
due to intersubtest branching. Intersubtest branching also resulted in 
test battery information levels that closely approximated those of the full 
test battery, in comparison to information levels obtained solely from the 
use of adaptive intrasubtest item selection (Research Report 81-4). Re- 
sults also indicated (Research Report 81-4) that the order in which sub- 
tests were selected for intersubtest branching had no effect on either the 
efficiency of terjt administration or on the psychometric characteristics of 
the resulting te:»t scores. 

2. The use of change scores to measure changes in achievement over time, which 
assumes that the factor underlying changes in performance is invariant, may 
be appropriate in some achievement testing environments and not in others. 
Results from college courses (Research Report 81-5) indicated that the fac- 
tor structure of measured achievement in a biology course was not the same 
prior to instruction as it was after several weeks of instruction. In a 
mathematics course, however, the factor structure of measured achievement 
did not change over a 10-week period. These results suggest that in the 
absence of information to indicate that the dimensionality of measured 
achievement does not change over time, it is inappropriate to compute sim- 
ple difference scores to measure changes in achievement levels. 

3. There was some indication (Kingsbury, 1984) that Bayesian adaptive tests 
using an individual prior achievement level estimate resulted in more reli- 
able change scores than were obtained from comparable conventional achieve- 
ment tests. Further research is needed, however, to investigate the gener- 
allzability of these findings in other achievement domains. 

4. Adaptive Mastery Testing (AMT) is a viable procedure for reducing cest 
length of mastery tests and improving the efficiency of mastery classifica- 
tions. In monte carlo simulation (Research Report 80-4) AMT achieved the 
best combination of test length reduction and validity of mastery classifi- 
cations in comparison with a sequential probability ratio classification 
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procedure and conventional tests scor'^d by proportion correct and IRT-based 
Bayesian scoring* The advantages of AMI were most pronounced in realistic 
item pools in which items varied in difficulties and discriminations. AMT 
also tended to result in a more even balance of false mastery and false 
non-mastery classifications in comparison to the sequential procedure. 

Results from the simulation study were supported in live testing (Research 
Report 81-3). In comparison to conventional achievement tests, both fixed- 
and v&riable-length AMTc> resulted in mastery classifications that were more 
consistent with an independent mastery criterion. The average variable- 
length adaptive test was able to make a high-confidence classification for 
scudents using only from 2 to 5 items, thus reducing test lengths as much 
as 7i\Z to 88% from the 20-item conventional test, with no loss in classifi- 
cation accuracy. 

Adaptive Ability Testing 

5. Owen's Bayesian adaptive testing strategy results in 6 estimates that, un- 
der realistic testing conditions, are biased and not of equal precision 
across 6 levels (Research Report 83-2 and Weiss & McBride, 1984). Only 
under the unrealistic situation in which true 0 was used as the prior 6 did 
Owen's procedure result in unbiased 9 estimates and reasonably horizontal 
information functions. Bias was also differentially affected by item dis- 
criminations for variable-length tests. In addition, for these tests, test 
length was an increasing function of e* The design of these studies al- 
lowed identification of the source of the bias to be the use of a constant 
(group) prior 9 estimate to begin the Bayesian adaptive testing. 

6. Errors in item parameter estimates do not seriously affect the performance 
of adaptive testing strategies (Crichton, 1981; Mattson, 1983). In the 
3-parameter data (Crichton, 1981) using indices combined across 6 levels, 
when error was introduced into the separate item parameter estimates the 
effects were small for errors within the usually observed range* The a and 
b^ parameters generally had similar effects on adaptive test performancl, 
while errors in the £ parameter had negligible effects. When errors in the 
three parameters were combined, effects differed little from the case with 
error in £ or except for very unrealistic levels of error. There were 
no appreciable differences in susceptibility to error among the stradap- 
tive, maximum information, and Bayesian adaptive testing strategies. 

7. When indices conditional on B were examined in the 3-parameter data 
(Crichton, 1981), Bayesiar* and maximum information adaptive tests were 
somewhat less susceptible U errors in item parameter estimates than was 
the stradaptive test. Whereas errors in estimation of the Jb and £ parame- 
ters had little effect on the conditional indices, estimation errors in the 
£ parameter resulted in the major effects on the conditional indices, indi- 
cating that large errors in estimating £ may deteriorate the performance of 
the adaptive testing strategies. Even with this deterioration in perfor- 
mance, however, the adaptive tests still performed better than the conven- 
tional tests for a substantial portion of the 9 range. 

8. When £ and Jb parameter estimates were allowed to correlate with each other. 
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as they do in many real item pools, there was no additional effect on 0 
estimates beyond that due to errors in uncorrelated item parameter esti- 
mates (Mattson, 1983). 

9. Maximum likelihood 8 estimation performed better than Bayesian estimation 
for lesser degrees of error in the item parameters > end Bayesian estimation 
was less affected by item parameter erors for more extreme levels of error, 
particularly for the 1- and 2-parameter models (Mattson, 1983). 

10. The 2-paramete model was least affected by etrors in item parameter esti- 
mates (Mattson, 1983). Under conditions of large errors in item parameter 
estimates, the 2-pararaeter model performed better than the error-free cases 
of 1- and 3-parameter models. 

lU Multidimensionality has a more seriou.i effect on 6 estimates from maximum 
information adaptive tests than does errors in item parameter estimates 
(Suhaldonik & Weiss, 1985). For multidimensional structures with one or 
two factors beyond the first that account for up to one-fourth the variance 
of the first factor, overcoming the effects of multidimensionality would 
require doubling of adaptive test length. The data also suggested that the 
number of factors, and not simply the overall strength of the factor struc- 
ture, affects 6 estimates, since a single factor beyond the first had less 
effect than did two factors that accounted for the same amount of variance. 
In general, however, adaptive testing is quite robust to irultidimensirnal 
structures of the type most frequently resulting from careJul item se ec- 
tion — i.e., factor structures with a strong first factor and second or 
third factors that account for less than one-eighth of the variance of the 
first factor. 

12* Administration of multiple-choice items in a probabilistic response mode 
may be a useful application of computerized test administration. Although 
items answered in a probabilistic mode did not result in higher validities 
than multiple-choice items responded to dichotomously, the probabilistic 
mode resulted in higher reliabilities and a stronger first factor (Research 
Report 83-3). Since the stronger first factor would result in higher IRT 
item discrimination parameters for these items, adaptive testing based on 
items administered probabilistically would likely be more efficient, re- 
sulting in shorter tests or in more precise 6 estimates. Additional analy- 
ses of item formats and response modes (Thompson, 1983) showed that items 
presented in a dichotomous format yielded different factor structures than 
did multiple-choice formats, but supported the higher reliabilities ob- 
served in Research Report 83-3 for the probabilistic response format. 

13« Computerized test administration variables — including adaptive vs. conven- 
tional test t^ ^ , computer- vs. self-paced item administration, and immedi- 
ate knowledge of results after each item is administered — do not have di- 
rect effects on ability test performance, as measured by estimated 6 levels 
(Research Report 81-2). These test administration variables do, however, 
have effects on psychological reactions to testing. Immediate knowledge of 
results appears to have a standardizing effect on test anxiety and test- 
taking motivation, since mean levels of anxiety and motivation were differ- 
ent when knowledge of results was provided bvt similar when it was not. 
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Perceptions of test difficulty were different for adaptive and conventional 
tests; students accurately perceived che conventional tests as either too 
easy or too difficult, depending on their ability levels, while the adap- 
tive test was generally accurately perceived as being of appropriate diffi- 
culty. 
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ABSTRACTS OF RESEARCH REPORTS 



Research Report 79-6 
Efficiency of an Adaptive Inter-Subtest Branching Strategy 
in the Measurement of Classroom Achievement 
Kathleen A. Gialluca and David J. Weiss 
November 1979 



A real-data simulation was conducted to Investigate the efficiency of an adap- 
tlx'e testing strategy designed for achievement test batteries applied to a 
rlassroom achievement test. This testing strategy combined adaptive item selec- 
tion routines both within and between the subtests of the test battery. Compar- 
isCii'j were made between the conventionally-administered tests and the simulated 
adaptive tests in terms of test length, psychometric information, and correla- 
tion?, of achievement estimates. Design of the study also permitted (1) separa- 
tion of the effects of the adaptive Intra-subtest item selection procedure and 
inter-subtest branching, (2) evaluation of the effects of different intra-sub- 
test termination criteria, (3) use of classical regression equations and regres- 
sion equations corrected for errors of measurement in the predictors, and (4) 
cross-validation stability of the inter-subtest branching regression predic- 
tions. Data consisted of the responses from 1,600 students to classroom-admin- 
istered final exams in a general biology course at the University of Minnesota. 

Total test length was reduced from 16% to 30% using the adaptive intra-sulteciC 
item selection strategy with a variable termination criterion that omits those 
items providing little information to the measurement process* Subtest-length 
reductions ranged from about 8% to 62%. Total test length was reduced another 
1% to 5% (with subtest-length reductions of up to 53%) upon the addition of an 
inter-subtest branching strategy that utilized regression equations witih prior 
information concerning a student *s performance. 

Reductions in subtest length were accomplished with virtually no loss in psycho- 
metric information. Correlations between the Bayeslan achievement estimates 
from the adaptive and conventional tests were uniformly high, typically r^ « *^90 
and higher. Results showed that the use of the corrected regression equations 
did little to Improve the performance of the inter-subtest branching; although 
the multiple correlations for the corrected equations were higher, both the in- 
formation curves and correlations of achievement estimates were generally lower. 
Cross-validation results indicated that the procedure can be used in different 
samples from the same population. 

Results from this study generally supported the generality of this adaptive 
testing strategy for reducing achievement test length with no adverse Impact on 
the quality of the measurements. Suggestions are made for further research with 
this testing strategy. (AD A080956) 
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Research Report 80-4 
A Comparison of Adaptive, Sequential, and Conventional Testing 
Strategies for Mastery Decisions 
G* Gage Kingsbury and David J* Weiss 
November 1980 

Two procedures for making mastery decisions with variable length tests and a 
conventional mastery testing procedure were compared In raonte carlo simulation. 
The simulation varied the characteristics of the Item pool used for testing and 
Che maximum test le-^rth al d. The procedures were compared In terms of the 
mean test length ne. ed to . ce a decision, the validity of the decisions made 
by each procedure, a.id the types of classification errors made by each proce- 
dure. Both of the variable test length procedures were found Lo result In Im- 
portant reductions In mean test length from the conventional test length. The 
Sequential Probability Ratio Test (SPRT) procedure resulted in greater test 
length reductions, on the average, than the Adaptive Mastery Testing (AMT) pro- 
cedure. However, the AMT procedure resulted both in more valid mastery deci- 
sions and in more balanced error rates than the SPRT procedure under all condi- 
tions. In addition, the AMT procedure produced the best combination of test 
length and validity. (AD A09A478) 

Research Report 80-5 
An Alternate-Forms Reliability and Concurrent Validity 
Comparison of Bayeslan Adaptive and Conventional Ability Tests 
G. Gage Kingsbury and David J. Weiss 
December 1980 

Two 30-ltem alternate forms of a conventional test and a Bayeslan adaptive test 
were administered by computer to 472 undergraduate psychology students. In 
addition, each student completed a 120-ltem paper-and-pencll test, which served 
as a concurrent validity criterion test, and a series of very easy questions 
designed to detect students who were not answering conscientiously. All test 
items were five-alternative multiple-choice vocabulary items. Reliability and 
concurrent validity of the two testing strategies were evaluated after the ad- 
ministration of each item for each of the tests, so that trends indicating dif- 
ferences in the testing strategies as a function of test length could be detect- 
ed. For each test, additional analyses were conducted to determine whether the 
two forms of the test were operationally alternate forms. 

Results of the analysis of alternate-forms correspondence indicated that for all 
test lengths greater than 10 items, each of the alternate forms for the two test 
types resulted in fairly constant mean ability level estimates. When the scor- 
ing procedure was equated, the mean ability levels estimated from the two forms 
of the conventional test differed to a greater extent than those estimated from 
the two forms of the Bayeslan adaptive test. 

The alternate-forms reliability analysis Indicated that r.he two forms of the 
Bayeslan test resulted in more reliable scores than the two forms of the conven- 
tional test for all test lengths greater than two items. This result was ob- 
served when the conventional test was scored either by the Bayeslan or propor- 
tion-correct method. 
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The concurrent validity analysis showed that the conventional test produced 
ability level estimates that correlated more highly with the criterion test 
scores than did the Bayesian test for all lengths greater than four items. This 
result was observed for both scoring procedures used with the conventional test. 

Limitations of the study, and the conclusions that may be drawn from it, are 
discussed. These limitations, which may have affected the results of this 
study, included possible differences in the alternate forms used within the two 
testing strategies, the relatively small calibration samples used to estimate 
the ICC parameters for the items used in the study, and method variance in the 
conventional tests. (AD A094477) 



The research literature on test theory and methoc for the period 1975 through 
early 1980 is critically reviewed. Research on classical test theory has con- 
centrated on relatively unimportant developments in reliability theory, with 
sume new developments and applications of gener alizability theory appearing dur- 
ing this period. The reliability of change or gain scores has received some 
attention from the classical test theory perspective, as have the applications 
of classical reliability concepts in experimental design and the analysis of 
experimental data. A minor amount of research with classical models was in the 
area of test-score equating. Classical item analysis procedures, however, re- 
ceived little attention. A fair amount of research during the period was devot- 
ed to different item types and test item response modes as replacements for the 
ubiquitous multiple-choice item. Several types of true-false items were pro- 
posed, and formula scoring was studied by a number of researchers in an attempt 
to reduce guessing effects. The perennial topic of response option weighting 
received attention, with efforts oriented toward demonstrating effects on valid- 
ity and reliability. Response modes studied included answer-until-correct , con- 
fidence weighting, and free-response. 

A number of alternatives to classical test theory were studied in an attempt to 
solve some of the problems for which classical test theory has proven to be 
inadequate. Research on criterion-referenced testing continued during this pe- 
riod. Latent trait test theory (item response theory, or IRT) received consid- 
erable attention. Research on the 1-parameter IRT model continued to address 
problems of parameter estimation, model fit, and equating. The question of the 
person-free and sample-free characteristics of this model (i.e., its robustness) 
were investigated, with results generally supporting these desirable character- 
istics. In addition, a special case of this model that can account for guessing 
was developed, and the model was generalized and success. Tully applied to poly- 
chotomous attitude types of items. Considerable research occurred on the 2- and 
3-parameter IRT models. The concept of information as a replacement for classi- 
cal reliability concepts was studied, and its uses in developing parallel te«cs 
were described. As with the 1-parameter IRT model, problems of parameter esti- 
mation and equating were investigated. These IRT models were successfully ap- 



Research Report 81-1 
Review of Test Theory and Methods 
David J. Weiss and Jiark L. Davison 
January 1981 
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plied Co problems of item option weighting and adaptive testing. Important de- 
velopments jiith these models during the period included the demonstration of 
their relationship with other psychological measurement models, and methods for 
determining fit of individuals to IRT models. As another alternative to classi- 
cal test theory, order models were developed and studied, and several other mod- 
els were proposed. 

Validity issues were also studied during this period. A number of approaches to 
the analysis of multitrait-multiraethod matrices were proposed and compared, in- 
cluding some based on structural equations models. Issues of predictive validi- 
ty studied included necessary sample sizes, validity generalization, and modera- 
tor and suppressor effects. Test fairness issues and their effects on validity 
received considerable attention. Concern was with (1) bias in selection; (2) 
fairness to minorities, including differential and single-groups validity and 
comparisons of regression lines, adverse impact, and bias in test content; and 
(3) fairness to women. 

It is concluded that little of consequence was accomplished in classical test 
theory during this period. The most important developments were in alternatives 
to classical test theory, primarily item response theory. Research in this area 
resulted in data and other developments that will permit a better understanding 
of the range of applicability of these models and their potential for solving 
measurement problems not solvable by classical models. (AD A096157) 



Research Report 81-2 
Effects of Immediate Feedback and Pacing of Item Presentation 
on Ability Test Performance and Psychological Reactions to Testinpc 
Marilyn F. Johnson, David J. Weiss, and J. Stephen Prestwood 

February 1981 

The study investigated the joint effects of knowledge of results (KR or no-KR) , 
pacing of icem presentation (computer or self-pacing), and type of testing 
strategy (50-item peaked conventional, variable-length stradaptive, or 50-item 
fixed-length stradaptive test) on ability test performance, test item response 
latency, information, and psychological reactions to testing. The psychological 
reactions to testing were obtained from Likert-type items that assessed test- 
taking anxiety, motivation, perception of difficulty, and reactions to knowledge 
of results. Data were obtained from 447 college students randomly assigned to 
one of the 12 experimental conditions. 

The results indicated that there were no effects on ability estimates due to 
knowledge of results, testing strategy, or pacing of item presentation. Al- 
though average latencies were greater on the stradaptive tests than on the con- 
ventional test, the overall testing time was not substantially longer on the 
adaptive tests and may have been a function of differences in test difficulty. 
Analysis of information values indicated higher levels of information on the 
stradaptive tests than on the conventional test. There was no statistically 
significant main effect for any of the three experimental conditions when test 
anxiety or test-taking motivation were the dependent variables, although there 
were some significant interaction effects. 
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These results indicate that testing conditions may interact in a complex way to 
determine psychological reactions to the tes::ing environment. The interactions 
do suggest, however, a somewhat consistent standardizing effect of KR on test 
anxiety and test-taking motivation. This standardizing effect of KR showed that 
approximately equal levels of motivation and anxiety were reported under the 
various testing conditions when KR was provided, but that mean levels of these 
variables were substantially different when KR was not provided, v^nsistent 
with theoretical expectations, the conventional test was perceived as being 
either too easy or too difficult, whereas the adaptive tests were perceived more 
often as being of appropriate difficulty. 

The results concerning the effects of KR on test performance, motivation, and 
anxiety found in this study were contrary to earlier reported findings; and dif- 
ferences in the studies are delineated. Recommendations are made concerning the 
control of specific testing conditions, such as difficulty of the test and abil- 
ity level of the examinee population, as well as suggestions for the further 
analysis of the standardizing effect of KR. (AD A097688) 



Research Report 81-3 
A Validity Comparison of Adaptive and Conventional 
Strategies for Mastery Testing 
G. Gage Kingsbury and David J. Weiss 
September 1981 

Conventional mastery tests designed to make optimal mastery classifications were 
compared with fixed-length and variable-length adaptive mastery tests in terms 
of validity of decisions with respect to an external criterion measure. Compar- 
isons between Che testing procedures were made across five content areas in an 
introductory biology course from tests administered to over 400 volunteer stu- 
dents. The criterion measure used was the student's final standing in the 
course, based on course examinations and laboratory grades. Results indicated 
that the adaptive test resulted in mastery classifications that were more con- 
sistent with final class standing than those obtained from the conventional 
test. This result was observed within individual content areas and for discrim- 
inant analysis classifications made across content areas. This result was also 
observed for two scoring procedures used with the conventional test (proportion- 
correct and Bayesian scoring). Results also indicated that there was no decre- 
ment in the performance of the adaptive test when a variable termination rule 
was implemented. This variable termination rule resulted in test lengths which 
were, on the average, 74% to 88% shorter than the original adaptive tests. Fur- 
ther analyses explicated the manner in which the adaptive tests administered 
differed fr om the conventional tes t for each content area as a function of 
achievement level. This evidence was used to explain why the adaptive tests 
resulted in more valid decisions than the conventional procedure, in spite of 
the fact that the type of conventional test used here was the most informative 
test concerning the mastery cutoff. It is concluded that variable-length adap- 
tive mastery tests can provide more valid mastery classifications than "optimal" 
conventional mastery tests while reducing test length an average of 80% from the 
length of the conventional tests. (AD A106867) 
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Research Report 81-4 
Factors Influencing the Psychometric Characteristics of an 
Adaptive Testing Strategy for Test Batteries 
Vincent A. Maurelll and David J. Weiss 
November 1981 

A monte carlo simulation was conducted to assess the effects In an adaptive 
testing strategy for test batteries of varying subtest order, subtest termina- 
tion criterion, and variable versus fixed entry on the psychometric properties 
of an existent achievement test battery. Comparisons were made among conven- 
tionally administered tests and adaptive tests using adaptive Intra-subtest item 
selection with and without Inter-subtest branching. Data consisted of responses 
of 300 slmulees to a 201-ltem achievement test battery. Mean test battery 
length was reduced from 42.5% to 52. 3Z using adaptive intra-subtest item selec- 
tion with variable termination. Reductions in mean subtest lengths ranged from 
27% to 67%. When inter-subtest branching was added, additional test length re- 
ductions of 1% to 2% were observed for individual subtests. The reductions in 
test length were achieved with no significant loss of fidelity or psychometric 
information. The addition of inter-subtest branching resulted in levels of mean 
test battery Information more similar to those of the full test battery, even 
with mean test battery reductions of 50% in number of items administered. Sub- 
test order was shown to have no effect on the evaluative criteria employed. The 
results generally supported previous studies of this adaptive testing strategy. 
Suggestions for future research are presented. (AD A109666) 



Research Report 81-5 
Dimensionality of Measured Achievement Over Time 
Kathleen A. Gialluca and David J. Weiss 
December 1981 



Some type of difference or change score is frequently used to quantify the 
effects of experimental treatments and educational programs on individuals and 
on groups of individuals. Whether the change measurement involves the use of 
simple difference scores, their derivatives, or some more complex methodological 
design, the measurement process itself assumes that the treatment or instruction 
results in higher levels of the originally measured variable and that the only 
change that occurs is a quantitative one. If this assumption is not met, then 
the computation of any type of difference score is Inappropriate and the scores 
themselves are useless for measuring growth or change. 

Two studies investigated the tenablllty of the assumption that classroom in- 
struction results in increases in students* achievement levels while the quali- 
tative nature of that achievement remains constant across time. The data util- 
ized were the item responses to tests in basic mathematics and in general biolo- 
gy administered as pretests and after instruction to students enrolled in those 
courses. 



Results indicated that this assumption was not tenable in the biology data set, 
where Increases in mean achievement level were accompanied by corresponding 
changes in the factor structure underlying the item responses. For the mathe- 
matics data, however, there was no such violation of the assumption: As student 
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achievement levels increased the underlying factor structure remained unchanged. 
The implications of these results for psychology, education, and program 
evaluation are noted. (AD A110955) 



Research Report 83-2 
Bias and Information of Bayesian Adaptive Testing 
David J. Weiss and James R. McBride 
March 1983 

Monte carlo simulation was used to investigate score bias and information char- 
acteristics of Owen's Bayesian adaptive testing strategy, and to examine possi- 
ble causes of score bias. Factors Investigated in three related studies includ- 
ed effects of an accurate prior 0 estimate, effects of item discrimination, and 
effects of fixed vs. variable test length. Data were generated from a three- 
parameter logistic modal for 3,100 simulees in each of eight data sets; Bayesian 
adaptive tests were administered, drawing items from a '^perfect'* item pool. 
Results showed that the Bayesian adaptive test yielded unbiased 9 estimates and 
relatively flat information functions only in the unrealistic situation in which 
an accurate prior 9 estimate was used-» Whsn a more realistic constant prior e 
estimate was used with a fixed test length, severe bias was observed that varied 
with item discrimination. A different pattern of bias was observed with varia- 
ble test length and a constant prior. Information curves for the constant prior 
conditions generally became more peaked and a^symmetric with increasing item dis- 
crimination. In the variable test length condition the test length required to 
achieve a specified level of the posterior variance of 9 estimates was an 
increasing function of 9 level. These results indicate that 8 estimates from 
Owen's Bayesian adaptive testing method are affected by the prior 9 estimate 
used and that the method does not provide measurements that are unbiased and 
equiprecise except under the unrealistic condition of an accurate prior 9 esti- 
mate. (AD A129280) 



Research Report 83-3 
Effect of Examinee Certainty on Probabilistic Test Scores 
and a Comparison of Scoring Methods for Probabilistic Responses 
Debra Suhadolnik and David J. Weiss 
July 1983 

The present study was an attempt to alleviate some of the difficulties inherent 
in multiple-choice items by having examinees respond to multiple-choice items in 
a probabilistic manner. Using this format, examinees are able to respond to 
each alternative and to provide indications of any partial knowledge they may 
possess concerning the item. The items used in this study were 30 multiple- 
choice analogy items. Examinees were asked to distribute 100 points among the 
four alternatives for each item according to how confident they were that each 
alternative was the correct answer. Each item was scored using five different 
scoring formulas. Three of these scoring formulas — the spherical, quadratic, 
and truncated log scoring methods—were reproducing scoring systems. The fourth 
scoring method used the probability assigned to the correct alternative as the 
item score, and the fifth used a function of the absolute difference between the 
correct response vector for the four alternatives and the actual points assigned 
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Co each alternative as the item score. Total test scores for all of the scoring 
methods were obtained by summing individual item scores. 

Several studies using probabilistic response methods have shown the effect of a 
response-style variable, called certainty or risk taking, on scores obtained 
from probabilistic responses. Results from this study showed a small effect of 
certainty on the probabilistic scores in terms of the validity of the scores but 
no effect at all on the factor structure or internal consistency of the scores. 
Once the effect of certainty on the probabilistic scores had been ruled out, the 
five scoring formulas were compared in terins of validity, reliability, and fac- 
tor structure. There were no differences in the validity of the scores from the 
different methods, but scores obtained from the two scoring formulas that were 
not reproducing scoring systems were more reliable and had stronger first fac- 
tors then the scores obtained using the reproducing scoring systems. For prac- 
tical use, however, the reproducing scoring systems may have an advantage be- 
cause they maximize examinees' scores when examinees respond honestly, while 
honest response3 will not necessarily maximize an examinee's score with the oth- 
er two methods. If a reproducing scoring system is used for this reason, the 
spherical scoring formula is recommended, since it was the most internally con- 
sistent and showed the strongest first factor of the reproducing scoring sys- 
tems. 
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